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ABSTRACT 

Motivation: Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, 
more in general, in systems biology 

Results: In this work we propose a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison 
method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study 
discriminating key groups of genes whose interactions are modified by an underlying condition. The proposal is independent from the 
classification algorithm used. Three applications on genomewide data are presented regarding children susceptibility to air pollution and two 
neurodegenerative diseases: Parkinson's and Alzheimer's. 

Availability: Details about the software used for the experiments discussed in this paper are provided in the Appendix. 
Contact: furlan@fbk.eu 

1 INTRODUCTION 

Nowadays, it is widely accepted as a consolidated fact that most of the known diseases are of systemic nature: their phenotypes can be 
attributed to the breakdown of a rather complex set of molecular interaction among the cell's components rather than imputed to the 
misfunctioning of a single entity such as a gene. A major aim of systems biology and, in particular, of its newly emerging discipline 



©2011 



1 



Baria et al 



known as network medicine l |Barabasi et a/.|pbl 1| >), is the understanding of the cellular wiring diagram at all possible levels of organization 
(from transcriptomics to signalling) of the functional design, the molecular pathways being a typical example. Such reconstruction is made 
feasible by the recent advances in the theory of complex networks (e.g. |Strogatz| ( [200T| l; |Newman| ( |2003) ; |Boccaletti et i2/7| ( |2006) ; [Newman] 
P010| l; [Buchanan et a/.|pOTo) ) and, in particular, in the reconstruction algorithms for inferring networks topology and wiring starting from 
a collection of high-throughput measureme nts ([He et al. | 2009^). However, the tackled problem is hard ("a daunting task", [Baralla et aL\ 
2009jl) and these methods are not flawless [Marbach et al. (20101, due to many factors. Among them, underdeterminacy is a major issue 
De Smet and Marchal (2010)), and the ratio between the network dimension (number of nodes) and the number of available measurements 
to infer interactions plays a key role for the stability of the reconstructed structure. Although some effort has recently been put into facing 
this issue, the stability (and thus the reproducibility) of the process is still an open problem. 

In this contribution we propose a pipeline for machine learning driven determination of the disruption of important molecular 
pathways induced or inducing a condition starting from microarray measurements in a case/control experimental design. The problem of 
underdeterminacy in the inference procedure is avoided by focussing only on subnetworks, and the relevance of the studied pathways for the 
disease is judged in terms of discriminative relevance for the underlying classification problem. The profiling part of the pipeline, composed of 
a classifier and a feature selection method embedded within an adequate experimental procedure or Data Analysis Protocol ( [The MicroArray| 
[Quality Control Co nsortium (2010I), is used to rank the genes with the highest discriminative power. These genes undergo an enrichment 
phase ( [Zhang et al, ( 2005a l; Subrarnanian et al. [j2005 j ) to individuate the involved whole pathways to keep track of the established functional 
dependencies that would otherwise get lost by limiting the subnetwork analysis to the sole selected genes. Finally, a network is inferred for 
both the case and the control samples on the selected pathways, and the two structure are compared to pinpoint the occurring differences and 
thus to detect the relevant pathway related variations. 

A noteworthy point of this workflow is the independence from its ingredients: the classifier, the feature ranking algorithm, the enrichment 
procedure, the inference method and the network comparison function. This last point is worth a comment: although already fruitfully used 
even in a biological context jSharan and Idekerj ^2006^ ), the problem of quantitatively comparing network (e.g. using a metric instead of 
evaluating network properties) is a widely open issue affecting many scientific disciplines. As discussed in ( [Jurman et al.\ ^2011^ ), many 
classical distances (such as those of the edit family) have a relevant drawback in being local, that is focussing only on the portions of the 
network interested by the differences in the presence/absence of matching links. More recently, other metrics can overcome this problem so 
to consider the global structure of the compared topologies; among such distances, the spectral ones - based on the list of eigenvalues of the 
laplacian matrix of the underlying graph - are quite interesting, and, in particular, the Ipsen-Mikhailov jlpsen and Mikhailov[^2002} ) distance 
has been proven to be the most robust in a wide range of situations. 

In what follows we will describe the newly introduced workflow in details, providing three examples of application in problems of 
biological interest: the first tasks concerns the transcriptomics consequences of exposition to environmental pollution on a cohort of children 
in Czech Republic, the second one investigates the molecular characteristics between Parkinson's disease (PD) at early and late stages and 
the third regards the characterization of Alzheimer's disease (AD) at early and late stages. To strenghten the support to our proposal, the 
two problems will be dealt with by using different experimental conditions, i.e., varying the employed algorithms throughout the various 
steps of the workflow. In both cases, biologically meaningful considerations can be drawn, consistent with previous findings, showing the 
effectiveness of the proposed procedure in the assessment of the occuring subnetwork variations. 



2 SYSTEM AND METHODS 

The proposed machine learning pipeline handles case/control transcription data through four main steps, from a profiling task output (a 
ranked list of genes) to the identification of discriminant pathways, see Figure [T| Alternative algorithms can be used at each step of the 
pipeline: as an example in the profiling part different classifiers, regression or feature selection methods can be adopted. In Section [6T| we 
describe the elementary steps used in the experiments. 
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Fig. 1. Schema of the analysis pipeline. 
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Formally, we are given a collection of n subjects, each described by a p-dimensional vector x of measurements. Each sample is also 
associated with a phenotypical label y = {1, —1}, assigning it to a class (e.g. pollution vs. no-pollution in Section [4!7| . The dataset is 
therefore represented by an n x p gene expression data matrix X, where p » n and a corresponding labels vector Y. 

The matrix X is used to feed the profiling part of the pipeline. We choose a proper Data Analysis Protocol ( |The MicroArray Quality| 
[Control Cons ortiiim'('2010')) for ensuring accurate and reproducible results and a prediction model. The model is built as a classifier (e.g., 
SRDA, |Cai e t al. ( 2008 1) or a regression method (e.g., ^1^2, |De Mol et a/." l2009 1) coupled with a feature selection algorithm. Thus, we obtain 
a ranked list of genes from which we extract a gene signature gi, ...,gk taking the top-fc most discriminant genes. The choice is performed 
by finding a balance between the accuracy of the classifier and the stability of the signature ( [The MicroArray Quality Control Consortiurn] 
pOTO) ). 

Applying pathway enrichment techniques (e.g., GSEA or GSA, [Zhang et a/.[ ( [2005a| l;[Subramania n et aL\\200 5')), we retrieve for each gene 
Qi the corresponding whole pathway = {hi, ht}, where the genes hj 7^ Qi not necessarily belong to the original signature gi, gi^. 
Extending the analysis to all the hj genes of the pathway allows us to explore functional interactions that would otherwise get lost. 

The subnetwork inference phase (e.g., WGCN or ARACNE, [Zhao et a/.H20f0) ; [Meyer et al.\\200S\ ) requires to reconstruct a network 
for each pathway pi by using the steady state expression data of the samples of each class y. The network inference procedure is limited to 
the sole genes belonging to the pathway pi in order to avoid the problem of intrinsic underdeterminacy of the task. As an additional caution 
against this problem, in the following experiments we limit the analysis to pathways having more than 4 nodes and less than 1000 nodes. For 
each Pi and for each y, we obtain a real- valued adjacency matrix, which is then binarized by choosing a threshold on the correlation values. 
This choice requires the construction of a binary adjacency matrix Np-^y,t^ for each pi, for each y and for a grid of threshold values t\, ...,tT. 
For each value ts of the grid, we compute for each pi both the distance D (e.g., the Ipsen-Mikhailov distance, see for details Section [6.1.6| ) 
between the case and control pathway graphs and the corresponding densities. We chose ts providing the best balance between the average 
distance across the pathways pi and the network density. For a fixed ts and for each pi, we obtain a score D{Np-^i^ts , Np-^-i,t^ ) used to 
rank the pathways pi. As an additional scoring indicator foi gi, gt, we also provide the difference between the weighted degree in the 
control (y = —1) and in the patient (y — 1) network: Ad{gi) = d_i((;i) — di{gi). A final step of biological relevance assessment of the 
ranked pathways concludes the pipeline. 

3 DATA DESCRIPTION 

Section|4]describes three different experiments. In the first experiment we used a genome-wide dataset created for investigating the effects 
of air pollution on children. In the second and third experiment we analyzed gene expression data on two neurodegenerative diseases: 
Parkinson's (PD) and Alzheimer's (AD). All the examples are based on publicly available datasets on the Gene Expression Omnibus (GEO). 

3.1 Children susceptibility to air pollution 

The first dataset (GSE7543) collects data of children living in two regions of the Czech Republic with different air pollution levels (' van[ 
[Leeuwen et a/.|p008|[2006| l): 23 children recruited in the polluted area of Teplice and 24 children living in the cleaner area of Prachatice. 
Blood samples were hybridized on Agilent Human lA 22k oligonucleotide microarrays. After normalization we retained 17564 features. 

3.2 Clinical stages of Parkinson's disease 

For PD we consider two publicly available datasets from GEO: GSE66I3 ( [Scherzer ef aljpOOTl l) and GSE20295 ( [Zhang et al]p.005c\ ). The 
former includes 22 controls and 50 whole blood samples from patients predominantly at early PD stages while the latter is composed of 53 
controls and 40 patients with late stage PD. Biological data were hybridized on Affymetrix HG-UI33A platform, estimating the expression 
of 22215 probesets for each sample. 

3.3 Clinical stages of Alzheimer's disease 

For AD we analyzed two GEO datasets: GSE9770 and GSE5281 (Liang et al.' (20W"20mS). The first includes 74 controls and 34 samples 
from non-demented patients with AD (since it is the earliest AD diagnosed, we will label it as early hereafter) and the second is composed 
of 74 controls and 80 samples from patients with late onset AD. The samples were extracted from six brain regions, differently susceptible 
to the disease: entorhinal cortex (EC), hippocampus (HIP), middle temporal gyrus (MTG), posterior cingulate cortex (PC), superior frontal 
gyrus (SFG) and primary visual cortex (VCX). The latter is known to be relatively spared by the disease, therefore we did not consider the 
samples within the VCX region. Overall, we analyzed 62 controls and 29 AD samples for GSE9770 and 62 controls and 68 AD samples for 
GSE5281. Biological data were hybridized on Affymetrix HG-U133Plus2.0 platform, estimating the expression of 54713 probesets for each 
sample. 

4 DISCUSSION 

4.1 Air Pollution Experiment 

The SRDA analysis of the air pollution dataset was performed within a 100 x 5-fold cross validation (CV) schema, producing a gene signature, 
characterizing the molecular differences between children in Teplice (polluted) and Prachatice (not polluted). The signature consists of 50 
probesets, corresponding to 43 genes, achieving 76% accuracy. 
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The enrichment analysis on the signature allowed a functional characterization of the relevant genes, identifying 1 1 enriched ontologies in 
GO (listed in Appendix Table |4]l. We then constructed the corresponding WGCN network for the 1 1 selected pathways for both cases and 
controls. Full details about the experiment are reported in Appendix Section [6.2.1| 

Table 1. Air Pollution Experiment: most important pathways ranked by the normalized Ipsen-Mikhailov distance e. The Entrez gene symbol ID is also 
provided for the selected probesets g\, ...,gh in the corresponding pathway. 



Pathway Code 


e 


Oene Symbol 


00:0043066 


0.257 




00:0001501 


0.149 


MATN3 


00:0007399 


0.093 


NRON 


00:0016787 


0.078 


DHX32, CLC 


00:0005516 


0.076 


MYHl 


00:0007275 


0.076 


FKHL18, H0XB8, OLIOl 


00:0006954 


0.048 


PR0K2 



In Table[T]we report the most biologically relevant pathways, ranked for decreasing normalized Ipsen-Mikhailov distance e, which provides 
a measure of the structural distance between the networks inferred for the two classes. The most disrupted pathway is 00:0043066, i.e. 
apoptosis followed by 00:0001501 i.e. skeletal development. Since the children under study are undergoing natural development, especially 
physical changes of their skeleton, the high differentiation between cases and controls of the 00:0001501 and the involvement of pathway 
00:0007275 i.e. developmental process is biologically very sound. Another relevant pathway is 00:0006954, representing the response to 
infection or injury caused by chemical or physical agents. Several genes included in 00:0005516, (i.e. calmodulin binding) bind or interact 
with calmodulin, that is a calcium-binding protein involved in many essential processes, such as inflammation, apoptosis, nerve growth, and 
immune response. This is a key pathway that is linked with all the above mentioned terms as well as to 00:0007399, i.e. nen'ous system 
development, being one of the most stimulated pathways together with 00:0001501. 

As described in Section[2]the pipeline also provides a score Ad of the variation of the number of interactions for gi, gk. The full list is 
provided in Appendix Tablep] here we discuss a subset of the most biologically relevant genes. 

FKHL18, HOXB8, PR0K2, DHX32, MATN3 are direcfly involved in the development. CLC is a key element in the inflammation and 
immune system. OLIOl is a transcription factor that works in the oligodendrocytes within the brain. NRON binds calcium and is a target 
for thyroid hormones in the brain. Finally, MYHl encodes for myosin that is a major contractile protein that forms striated, smooth and 
non-muscle cells. MYHl isoforms show expression that is spatially and temporally regulated during development. 

Figure [2] shows the network of the 00:0007399 pathway, related to the nervous system development in the two cohorts. It is clear that 
several connections among the genes within this pathway are missing in the subjects living in the polluted area (Teplice). Therefore the 
nervous system development in these children is potentially at risk compared to those living in the not polluted city (Prachatice). 




(a) Prachatice (b) Teplice 



Fig. 2. Networks of the pathway GO:0007399 (nen'OM.v sy.stem development) for Prachatice children (a) compared with Teplice children (b). Node diameter 
is proportional to the degree, and edge width is proportional to connection strength (estimated correlation). 



4.2 Parkinson's Disease Experiment 

The l\l2 analysis of the two PD datasets was performed respectively within a 9-fold nested CV loop for the early PD and 8-fold nested 
CV for the late PD. The early PD signature consists of 77 probesets, mapped on 70 genes, and associated to 62% accuracy. The late stage 
signature is composed of 94 probesets corresponding to 90 genes and achieving 80% accuracy. 
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Table 2. PD: most important pathways ranked by normalized Ipsen-Mikhailov distance e. The Entrez gene symbol ID is also provided for the selected 
probesets gi, gj; in the corresponding pathway. In bold, common elements between early and late stage PD. 





Pathway Code 


e 


Gene Symbol 


PD Gcirly 


GO-0005506 


0.38 


HBB 




GO:0006952 


0.37 


DEFA1/DEFA3 




GO:0045087 


0.36 






GO-0042802 


0.33 


SYTl 




GO:0042802 


0.33 


HSPBl 




GO:0006955 


0.31 


DEFA1/DEFA3 




GO:0006950 


0.28 


HSPBl 




GO:0020037 


0.26 


HBB 




GO:0005938 


0.26 


MYHIO 




GO:0005856 


0.24 


VCL 




GO:0005856 


0.24 


HSPBl 




GO:0003779 


0.23 


MYHIO, VCL 




GO:0030097 


0.15 






GO:0009615 


0.14 






GO:0009615 


0.14 


DEFA1/DEFA3 




GO:0051707 


0.00 




PD late 


GO:0019226 


0.31 






GO:0007611 


0.16 






GO:0042493 


0.15 






GO:0009725 


0.11 






GO:0030424 


0.10 


MYHIO 




GO:0007267 


0.09 


TACl 




GO:0005516 


0.09 


MYHIO, SYTl, RGS4 




GO:0005096 


0.09 


RGS4 




GO:0007610 


0.08 






GO:0003779 


0.08 


MYHIO, VCL 




GO:0005624 


0.08 


SLC18A2 




GO: 0045 202 


0.08 


SYTl 




GO:0003924 


0.07 


CDC42 




GO:0006928 


0.07 


HSBPl, VCL 




GO:0042995 


0.07 


CDC42 




GO:0007268 


0.06 


TACl, SYTl 




GO:0043234 


0.06 


VCL 




GO:0005525 


0.05 


CDC42 




GO:0006412 


0.05 


RPS4Y 




GO:0006836 


0.05 


SLC18A2, SLC6A3 




GO:0043005 


0.05 


MYHIO, SYTl 




GO:0043025 


0.04 


MYHIO 




GO:0042221 


0.00 






GO:0009266 


0.00 






GO:0014070 


0.00 





Applying ARACNE, we constructed the relevance network for both cases and controls for the 35 enriched pathways for late stage PD 
case and 42 pathways for early stage PD. Table |2]reports the most biologically relevant pathways, ranked for decreasing normalized Ipsen- 
Mikhailov distance e. The full list of the analyzed pathways is provided as Appendix Table[6] 

Having characterized the functional alteration of pathways for both early and late stage PD, we attempt a comparative analysis of the 
outcome, commenting the most meaningful results from the biological viewpoint. We expected some common pathways between the 
two stages, especially within pathways that represent general processes and functions, but as commented in Section |2] the pipeline does 
not consider pathways having more that 1000 nodes, hence discarding the general terms in the GO. Indeed, the only common pathway 
is GO:0003779, i.e. actin binding. Actin participates in many important cellular processes, including muscle contraction, cell motility. 
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cell division and cytokinesis, vescicle and organelle movement, cell signaling. Clearly, this term is strictly associated to the most evident 
movement-related symptoms in PD, including shaking, rigidity, slowness of movement and difficulty with walking and gait. 

In both early and late PD we note some alteration within the biological process class of response to stimulus. In the early PD list we 
identified GO:0006950 i.e. response to stress, GO:0009615 i.e. response to virus and GO:005I707 i.e. response to other organism. In the late 
PD list we found 00:0042493, i.e. response to drug, 00:0009725 i.e. response to hormone stimulus, 00:0042221 i.e. response to chemical 
stimulus, 00:0014070 i.e. response to organic cyclic substance and 00:0009266 response to temperature stimulus. 

The pathways specific to early PD show a great involvement of the immune system, which is greatly stimulated by inflammation especially 
located in particular brain regions (mainly substantia nigra). Indeed, we identified: 00:0006952 i.e. defense response, 00:0045087 i.e. 
innate immuno response also visualized in Figure|3] 00:0006955 i.e. immune response and 00:0030097 i.e. hemopoiesis. 




(a) early PD patients (b) controls 



Fig. 3. Networks of the pathway GO:0045087 (innate immune response) for PD early development patients (a) compared with healty subjects (b). Node 
diameter is proportional to the degree, and edge width is proportional to connection strength (estimated coiTelation). 



From Figure[3]it is clear that since the early stages of PD the innate immune system is severly compromised: the body is highly subjected 
to the invasion and proliferation of microbes (like bacteria or viruses), resulting in a debilitated organism, less effective in fighting the 
consequent inflammation. 

In late stage PD, we detected several differentiated terms related to the Central Nervous System. Among others, we mention: 00:0019226 
i.e. transmission of nerve impulse, 00:0007611 i.e. learning or memory, 00:0007610 i.e. behavior and 00:007268 i.e. synaptic 
transmission. These findings are fitting the late stage PD scenario, where cognitive and behavioral problems may arise with dementia. 

Table|2] and Appendix Table[7]and[8]report the genes belonging to the most relevant pathways for early and late PD, respectively. 

Four common genes were identified between early and late stage PD: MYHIO, SYTI, VOL and HSPBI. MYHIO is involved in several 
pathways: acting binding and calmodulin binding, neural cell body, neuron projection and cell cortex. These pathways indicate that the 
damage mostly occurs in the neurons and especially the actin binding and the cell cortex affect the cytoskeleton and the muscular tissue. 
At the same time, the calmodulin binding pathway indicates that other preprocesses, related to calmodulin and relevant for PD, might be 
damaged. These processes are related to the inflammation, metabolism, apoptosis, smooth muscle contraction, intracellular movement, short- 
term, long-term memory, nerve growth and the immune response. Moreover, it is known that MYHIO is involved in the regulation of the 
actin cytoskeleton pathways but also in that ones related to the axon guidance. Mutations in this gene are known to be present in disease 
phenotypes affecting the heart and the brain ( |Kim et i2r|^2005| l). The synaptotagmin SYTI, also involved in the calmoduling binding, is 
an integral membrane protein of synaptic vesicles thought to serve as Ca(2-l-) sensor in the process of vesicular trafficking and exocytosis. 
Calcium binding to SYTI participates in triggering neurotransmitter release at the synapse. This protein is therefore involved in the synaptic 
transmission and it predominantly works in the neuron projections and synapses. Vinculin (VCL) is a cytoskeletal protein associated with 
cell-cell and cell-matrix junctions, where it is thought to function as one of several interacting proteins involved in anchoring F-actin to the 
membrane. Defects in VCL are the cause of cardiomyopathy dilated type IW. This protein is involved in cell motility, proliferation and 
differentiation but also in smooth muscle contraction, inflammation and immune surveillance. VCL is located on a locus of chromosome 
10 strongly associated with late onset AD l |Orupe et fl/.| ( |2006) ). HSPBI is a heat shock protein induced by environmental stress and 
developmental changes. The encoded protein is involved in stress resistance and actin organization and translocates from the cytoplasm to 
the nucleus upon stress induction. This translocation occurs in order to modulate SPl -dependent transcriptional activity to promote neuronal 
protection p<riedman et a/. | (2009^ ). Furthermore, defects in this gene cause two neurophatic diseases (i.e. Charcot-Marie-Tooth disease type 
2F and distal hereditary motor neuropathy). 

Beside the common genes, early stage PD is characterized by several meaningful genes. HBB encodes for hemoglobin beta that, together 
with another hemoglobin beta and two hemoglobin alpha, forms the adult hemoglobin. The work of Atamna and Boyle (20061 shows that the 
binding of Abeta to the heme group (hemoglobins bond to iron) supports a unifying mechanism by which excessive Amyloid-beta (Abeta) 
induces heme deficiency, causes oxidative damage to macromolecules, and depletes specific neurotransmitters. Althought Abeta is a known 
marker for AD, a recent publication also places it within a panel of PD biomarkers jShi ef a/.] ( |20Il^ ). DEFAI and DEFA3 are both defensins, 
a family of microbicidal and cytotoxic peptides thought to be involved in host defense. They are abundant in the granules of neutrophils and 
also found in the epithelia of mucosal surfaces such as those of the intestine, respiratory tract, urinary tract, and vagina. Recently, An drianov| 
\et fl/.| ( |2007) presented some evidence for the recruitment of defensins in communication between the immune and nervous systems in the 
frog. 
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Among the genes specific to the late PD, we note CDC42, a GTPase of the Rho-subfamily, which regulates signaling pathways controlling 
diverse cellular functions including cell morphology, migration, endocytosis and cell cycle progression. Through the interaction with other 
proteins, CDC42 is known to regulate actin polymerization constituent both of the cytoskeleton and of the muscle cells. SLC6A3 is a 
dopamine transporter which is a member of the sodium- and chloride-dependent neurotransmitter transporter family. This gene is associated 
with Parkinsonism-dystonia infantile jKurian et al\\2Q09\ ). Other significant genes within Table|2]are: TACl, RGS4, SLC18A2 and RPS4Y. 

4.3 Alzheimer's Disease Experiment 

Classification and feature selection via I-1I2, performed within a 9-fold nested CV schema for AD early and 8-fold for AD late, gives 
respectively 90% accuracy and 95% with 50 probesets for both cases. 




(a) early AD patients (b) controls 



Fig. 4. Networks of the pathway GO:0019787 for AD early development patients (a) compared with healthy subjects (b). Node diameter is proportional to 
the degree, and edge width is proportional to connection strength (estimated correlation). 



We apply in the AD case the same network analysis strategy as in the PD experiment inferring for both cases and controls 51 selected 
pathways for early stage AD and 34 for late stage AD. The full list of reconstructed pathways is reported in Table|9] In Table|3]we summarize 
the main findings discussed hereafter. 

Similarly to the PD analysis, we attempt a comparative analysis of the outcome for early and late stage AD having characterized the 
functional alteration of pathways for the two AD stages and comment the most meaningful results from the biological viewpoint. 

Four common pathways were identified: GO:0019226 i.e. transmission of nerve impulse, 00:0008015 i.e. blood circulation, 00:0000267 
i.e. cell fraction and 00:0042598 i.e. vesicular fraction. 

The majority of pathways characterizing early stages of AD are related to the nervous system, and the blood. Among the nervous 
system related pathways the most damaged are: 00:0007399 i.e. nervous system development, 00:0007417 i.e. central nervous system 
development, 00:0042391 i.e. regulation of membrane potential, 00:0042552 i.e. myelination, 00:0050877 i.e. neurological system 
process, 00:0001508 i.e. regulation of action potential and 00:0019226 i.e. transmission of nerve impulse. 

The majority of the pathways characterizing late stage AD are related to the cell, to the nervous system and to the response of the organism 
to various stimuli, see Table [5] and [5] Among the pathways centered on the cell, mentioned in descending order based on the numerosity of 
the genes, there are: 00:0008283 i.e. cell proliferation, 00:0008283 i.e. negative regulation of cell proliferation, 00:0008284 i.e. positive 
regulation of cell proliferation, 00:0042127 i.e. regulation of cell proliferation, 00:0030334 i.e. regulation of cell migration. The pathways 
related to the nervous system are: 00:0007268 i.e. synaptic transmission, 00:0007610 i.e. behavior, 00:0050890 i.e. cognition. Other 
relevant nodes are those related to the transcription regulation (00:0016564, 00:0045892), the visual perception (00:0007601), and the 
heme and lipid binding (i.e. 00:0020037, 00:0008289). 

The genes characterizing the early stage AD are reported in Table [5] and [To] UBE2D3 is an ubiquitin, targeting abnormal or short-lived 
proteins for degradation. It is a member of the E2 ubiquitin-conjugating enzyme family. This enzyme functions in the ubiquitination of the 
tumor-suppressor protein p53. It is also involved in several signaling pathways (BMP, TOF-/3, TNF-a/NF-kB and in the immune system), 
in the protein processing in the endoplasmatic reticulum. PTODS is an enzyme that catalyzes the conversion of prostaglandin H2 (POH2) to 
postaglandin D2 (POD2). It functions as a neuromodulator as well as a trophic factor in the central nervous system and it is also involved 
in smooth muscle contraction/relaxation and is a potent inhibitor of platelet aggregation. This gene is preferentially expressed in brain. 
Quantifying the protein complex of POD2 and TTR in CSF may be useful in the diagnosis of AD, possibly in the early stages of the disease 
( [Lovell et i2/.| ( |2008] l). EOFR is a transmembrane glycoprotein that is a member of the protein kinase superfamily. This protein is a receptor 
for members of the epidermal growth factor family that binds to epidermal growth factor. Binding of the protein to a ligand induces receptor 
dimerization and tyrosine autophosphorylation and leads to cell proliferation. This gene is involved in several pathways related to signaling, 
some type of cancer, to the cell proliferation, migration and adhesion and to the axon guidance. It is expressed in pediatric brain tumors 
( [Patereli et a/. | ( [2010) ). NTRK2 is member of the neurotrophic tyrosine receptor kinase (NTRK) family. This kinase is a membrane-bound 
receptor that upon neurotrophin binding phosphorylates itself and members of the MAPK pathway. Signalling through this kinase leads to 
cell differentiation. Mutations in this gene have been associated with obesity and mood disorders. SNPs in this gene is associated with AD 
( |Cozza et fl/T]p308) ). 

The genes associated to late stage AD are listed in Table [3] and [TT| Even if SNC A is a known hallmark for PD, it also known to be 
expressed in late-onset familial AD jTsuang et aL\ p006) ). Other relevant genes are: SPEN, EIF2AK1, CAT, HBD, ATXNI, XK. The 
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Table 3. AD: most important pathways ranked by normalized Ipsen-Mikiiailov distance e. Tlie Entrez gene symbol ID is also provided for the selected 
probesets gi, ...,gk in the corresponding pathway. In bold, common pathways between eai'ly and late stage AD. 





Pathway Code 


e 


Oene Symbol 


AD early 


GO:0042598 


0.21 






00:0019787 


0.16 


UBE2D3 




00:0007417 


0.10 


MPB 




00-0001508 


0.14 






GO: 005 1246 


0.15 


UBE2D3 




00:0016874 


0.12 


UBE2D3 




00:0004842 


0.11 


UBE2D3 




00:0005768 


0.08 


EGFR 




00:0016567 


0.07 


UBE2D3 




00:0050877 


0.06 






00:0042552 


0.05 






GO:0008015 


0.04 






00:0042391 


0.04 






00:0007399 


0.04 


NTRK2 




00:0046982 


0.03 


EGFR 




00:0006633 


0.02 


PTODS 




GO:0019226 


0.00 






00:0000267 


0.00 




AD late 


00:0040012 


0.36 


SNCA 




GO:0042598 


0.23 






GO:0019226 


0.12 






00:0030334 


0.10 






00:0045892 


0.09 


SPEN 




00:0042493 


0.06 


SNCA 




00:0042127 


0.05 






00-0008283 


0.04 


CAT 




00 0005215 


0.03 


XK 




00:0008217 


0.03 


HBD 




00:0007601 


0.03 






00:0007268 


0.03 






00:0007610 


0.03 






00:0008289 


0.03 






00:0008015 


0.02 






00:0016564 


0.02 


SPEN, ATXNl 




00:0008284 


0.02 






00:0008285 


0.02 


EIF2AK1 




00:0020037 


0.02 


EIF2AK1, CAT, HBD 




GO:0000267 


0.00 






00:0050890 


0.00 





first gene a hormone inducible transcriptional repressor. Repression of transcription by this gene product can occur through interactions 
■with other repressors by the recruitment of proteins involved in histone deacetylation or through sequestration of transcriptional activators. 
SPEN is involved in the Notch signaling pathway that is important for cell-dell communication since it involves gene regulation mechanisms 
that control multiple cell differentiation processes {i.e. neuronal function and development, stabilization of arterial endothelial fate and 
angiogenesis, cardiac valve homeostasis) during embryonic and adult life. EIF2AK1 acts at the level of translation initiation to downregulate 
protein synthesis in response to stress, therefore it seems to have a protective role diminishing the overproduction of proteins such as 
SNCA or beta amyloid. CAT encodes for catalase a key antioxidant enzyme in the bodies defense against oxidative stress, therefore it act 
against the oxidative stress present in the brain of AD patients. This gene together with EIF2AK1 seems to fight against the disease. HBD 
like, HBB commented in subsection |4.2[ could display the same role i jAtamna and Boyle] p006)). ATXNl i s involved in the autosomal 
dominant cerebellar ataxias (ADCA), an heterogeneous group of neurodegenerative disorders characterized by progressive degeneration of 
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the cerebellum brain stem and spinal cord. Therefore, because of specific characteristics of these diseases (like the affected brain areas and the 
characteristics of the movement disorders), it might as well play a role in AD. Finally, mutations of XK have been associated with McLeod 
syndrome an X-linked recessive disorder characterized by abnormalities in the neuromuscular and hematopoietic systems. 

5 CONCLUSION 

The theory of complex networks has recently proven to be a helpful tool for a systematic and structural knowledge of the cell mechanisms. 
Here we propose to enhance its capabilities by coupling it with a machine learning driven approach aimed at moving from a global to a 
local interaction scales, that is, focussing on pathways which are most likely to change, for instance within particular pathological stages. 
Such strategy is also better tailored to deal with situations where small sample size may affect the reUability of the network inference on a 
global scale. The method, demonstrated on three disease datasets of environmental pollution, PD and AD, was able to detect biologically 
meaningful differential pathways. 
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6 APPENDIX 

6.1 Experimental setup for the examples 

The presented analysis pipeline is independent from the single algorithms chosen for each step of the workflow. Here we give some details 
about the methods used for the experiments described in Section|4] 

6.1.1 Spectral Regression Discriminant Analysis (SRDA). SRDA belongs to the Discriminant Analysis algorithms family \Cai et a/.] 
( |2008[ l). Its peculiarity is to exploit the regression framework for improving the computational efficiency. Spectral graph analysis is used for 
solving only a set of regularized least squares problems avoiding the eigenvector computation. A score is assigned to each feature and can be 
interpreted as a feature weight, allowing directly feature ranking and selection. The regularization value a is the only parameter needed to be 
tuned. The method is implemented in Python and it is available within the mlpy library]^ 

6.1.2 The £±(.2 feature selection framework {(.i£2fs)- ^1^2 fs with double optimization is a feature selection method that can be tuned 
to give a minimal set of discriminative genes or larger sets including correlated genes iZou and Hastie.(2005l ; |De Mol et a/.] ( |2009) ). The 
objective function is a linear model f{x) — fix, whose sign gives the classification rule that can be used to associate a new sample to one of 
the two classes. The sparse weight vector /? is found by minimizing the f 1 ^2 functional : 1 1 y - /3X 1 1 i + T 1 1 /3 1 1 1 + 1 1 ^ 1 1 i where the least square 
error is penalized with the £1 and £2 norm of the coefficient vector /3. The training for selection and classification requires a careful choice 
of the regularization parameters for both £i£2 and RLS . Indeed, model selection and statistical significance assessment is performed within 
two nested i^-cross validation loops as in Fardin et al. 1 20091 . The framework is implemented in Python and uses the L1L2Py librarj]^ 



6.1.3 Functional Characterization. The Gene Set Enrichment Analysis (GSEA) was performed by using WebGestalt, an online toolkij^ 
This web-service takes as input a list of relevant genes/probesets and performs a GSEA analysis ( Subramaruan et aEj (|2005^ ) in Kyoto 



Encyclopedia of Genes and Genomes (KEGG, [Kanehisa and Goto] ( |2000[ l) and Gene Ontology (GO, |Ashbumer etal. ( 2000| l), identifying the 



most relevant pathways and ontologies in the signatures. Both for KEGG and GO we selected the WebGestalt human genome as reference 
set, 0.05 as level of significance, 3 as the minimum number of genes and the default Hypergeometric test as statistical method. 

6.1.4 Weighted Gene Co-Expression Networks (WGCN). WGCN networks are based on the idea of using (a function of) the absolute 
correlation between the expression of a couple of genes across the samples to define a link between them. Soft thresholding techniques are 
then employed to obtain a binary adjacency matrix, where a suitable biologically motivated criterion (such as the scale-free topology, or some 
other prior knowledge) can be adopted ( jZhang and Horvath| ( [2005| l; |Zhao et a/.H2010l l). Due to the very small sample size, scale-freeness 
can not be considered as a reliable criterion for threshold selection so we adopted a different heuristics: for both networks in the two classes 
the selected threshold is the one maximising the average Ipsen-Mikhailov distance on the selected pathways. 

6.1.5 Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE). ARACNE is a recent method for inferring 
networks from the transcription level (Margolin et fl/.| ( |2006) ) to the metabolic level i jNemenman et flllpOOTl l). Beside it was originally 
designed for handling the complexity of regulatory networks in mammalian cells, it is able to address a wider range of network deconvolution 
problems. This information-theoretic algorithm removes the vast majority of indirect candidate interactions inferred by co-expression 
methods by using the data processing inequality property jCover and Thomasl ( |1991[ l). In this work we use the MiNET (Mutual Information 
NETworks) Bioconductor package keeping the default value for the data processing inequality tolerance parameter ( [Meyer et fl/.| ( |200"8] )). 
The adopted threshold criterion is the same as the one applied for WGCN. 

6.1.6 Ipsen-Mikhailov distance e. Although already fruitfully used even in a biological context ( [Sharan and Ideker|j2006^ ), the problem 
of quantitatively comparing network {e.g. using a metric instead of evaluating network properties) is a widely open issue affecting many 
scientific disciplines. As discussed in (jjurman et many classical distances (such as those of the edit family) have a relevant 
drawback in being local, that is focussing only on the portions of the network interested by the differences in the presence/absence of matching 
links. More recently, other metrics can overcome this problem so to consider the global structure of the compared topologies; among such 
distances, the spectral ones - based on the list of eigenvalues of the laplacian matrix of the underlying graph - are quite interesting, and, in 
particular, the Ipsen-Mikhailov jlpsen and Mikhailov| ( |2002) ) distance has been proven to be the most robust in a wide range of situations. 

The definition of the e metric follows the dynamical interpretation of a A^-nodes network as a A'^-atoms molecules connected by identical 
elastic strings, where the pattern of connections is defined by the adjacency matrix of the corresponding network ( [Ipsen and Mi k hailov| 
( |2002[ l). The vibrational frequencies uji of the dynamical system are given by the eigenvalues of the Laplacian matrix of the network: Ai = 

iV-l 

—ul, with Xq = ujQ = 0. The spectral density for a graph as the sum of Lorentz distributions is defined as piu)) — K ^ , 

^ {u} - ujk)2 + 72 

z — 1 

where 7 is the common widttj^and K is the normalization constant solution of p{uj)duj = 1. Then the spectral distance e between two 
graphs G and H with densities pc {ijj) and pn (tt') can then be defined as [pG(tj) — pnioj)]^ du . To get a meaningful comparison 



http://slipguru.disi.unige.it/Research/LlL2Py 




http://bioinfo.vanderbilt.edu/webgestalt/ 


Zha 


ng et al. 


2005b 



7 specifies the half-width at half-maximum (HWHM), equal to half the interquartile range. 



10 



Machine Learning Pipeiine for Discriminant Pattiways Identification 



of the value of e on pairs of networks with different number of nodes, we define the normalized version e(G, H) = — ^ — ' — where E„, 

e{Fn, En) 

F„ indicate respectively the empty and the fully connected network on n nodes: they are the two most e-distant networks for each n. The 
common width 7 is set to 0.08 as in the original reference: being a multiplicative factor, it has no impact on comparing different values of 
the Ipsen-Mikhailov distance. The network analysis phase is implemented in R through the igraph package. 

6.2 Experiments 

6.2.1 Air Pollution Experiment Table|4]lists the 1 1 enriched pathways identified during the analysis of the air pollution dataset and the 
total number of the genes belonging to each pathway. The list is ranked by the normalized Ipsen-Mikhailov distance e (see Section [6.1.6^ : 
the top elements of the list are the most disrupted pathways between the two conditions. The pathways listed in Table[T]are a subset of those 
reported in Table |4] 

Most of these pathways concern the developmental processes: this GO class contains ontologies especially related to the development of 
skeletal and nervous systems (GO:0001501 and 00:0007399) that undergo a rapid and constant growth in children. Other enriched terms are 
related to the capacity of an organism to defend itself (i.e response to wounding, 00:000961 1 and inflammatory response, 00:0006954), to 
the regulation of the cell death (i.e. negative regulation of apoptosis, 00:0043066), the multicellular organismal process, 00:0032501, the 
glycerlolipid metabolic process, 00:0046486, the response to external stimuli (i.e. inflammatory response, response to wounding) and to the 
locomotion (i.e. 00:0040011, 00:0007626). 

Table 4. Air Pollution Experiment: pathways corresponding to mostly discriminant genes gi, ■■■,gk ranked by the normalized Ipsen-Mikhailov distance i. 
The number of genes belonging to the pathway is also provided. 



Pathway e # Oenes 



00 


: 0043066 





.257 


21 


00 


:0001501 





.149 


89 


00 


: 00096 11 





.123 


16 


00 


: 0007399 





.093 


252 


00 


: 0016787 





.078 


718 


00 


: 00055 16 





.076 


116 


00 


: 0007275 





.076 


453 


00 


: 0006954 





.048 


180 


00 


: 0005615 





.038 


417 


00 


: 0007626 





.000 


5 


00 


: 0006066 





.000 


8 



Table [5] provides the subset of Agilent probesets (together with their corresponding Oene Symbol and OO pathway) belonging to the 
signature gi, ...,gk and having a non zero value of the differential node degree Ad. Since the Ad score is computed as the difference 
between the weighted degree in the two classes, the top elements in Table[5]are those whose number of interactions varies most between the 
two conditions. 
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Table 5. Air Pollution Experiment: list of Agilent probesets in the signature with their corresponding Entrez Gene Symbol ID and GO pathway. The list is 
ranked according to the decreasing absolute value of the differential node degree Ad. 



Agilent ID Gene Symbol Pathway Ad 



4701 


NRGN 


GO: 


10007399 


-2, 


.477 


12235 


DUSP15 


GO 


:0016787 


-1 


.586 


804.4 






•fim 

.yty) 1 u / o / 




4S^ 


3697 


ITGB5 


GO 


: 0007275 


-1 


.390 


4701 


NRGN 


GO: 


10005516 


-1 


.357 


12537 


PROK2 


GO 


:0006954 


1, 


.069 


13835 


OLIGl 


GO 


:0007275 





.834 


11673 


HOXB8 


GO 


:0007275 


-0 


.750 


16424 


FKHL18 


GO 


: 0007275 


-0 


.685 


13094 


DHX32 


GO 


10016787 


-0 


.575 


8944 


CLC 


GO 


:0007275 





.561 


14787 


MATN3 


GO 


:0001501 





.495 


15797 


CXCLl 


GO 


:0006954 


0, 


.467 


15797 


CXCLl 


GO 


: 0005615 





.338 


11302 


MYHl 


GO: 


10005516 


-0 


.194 


15797 


CXCLl 


GO: 


10007399 





.131 
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6.2.2 Parkinson's Disease Experiment Table[6]reports the list of the pathways selected by the presented approach both for the early and 
late PD case, ranked by the normalized distance e. Some of these pathways are also reported in Table|2] 



Table 6. PD Experiment: selected pathways for early (left) and late (right) stage corresponding to mostly discriminant genes gi, .. 
normalized Ipsen-Mikhailov distance e. The number of genes belonging to the pathway is also provided. In bold, the common pathways. 



, gk ranked by the 



PD early 
Pathway e # Genes 



PD late 
Pathway e 



# Genes 



GO 


0012501 





49 


4 


GO 


0019226 





31 


20 


GO 


0005764 





39 


257 


GO 


0010033 





20 


30 


GO 


0019901 





38 


116 


GO 


0007611 





16 


34 


GO 


0005506 





38 


434 


GO 


0030234 





15 


20 


GO 


0008219 





38 


110 


GO 


0042493 





15 


109 


GO 


0016323 





37 


111 


GO 


0032403 





12 


14 


GO 


0006952 





37 


160 


GO 


0019717 





12 


79 


GO 


0046983 





36 


153 


GO 


0009725 





11 


27 


GO 


0045087 





36 


112 


GO 


0030424 





10 


93 


GO 


0046914 





35 


51 


GO 


0005096 





09 


252 


GO 


0016265 





33 


6 


GO 


0007267 





09 


264 


GO 


0042802 





33 


473 


GO 


0050790 





09 


15 


GO 


0042803 





32 


411 


GO 


0019001 





09 


34 


GO 


0050896 





31 


213 


GO 


0017111 





09 


157 


GO 


0006955 





31 


778 


GO 


0007585 





09 


47 


GO 


0006915 





31 


687 


GO 


0005516 





09 


215 


GO 


0042981 





30 


206 


GO 


0005626 





09 


41 


GO 


0030218 





29 


33 


GO 


0045202 





08 


278 


GO 


0006950 





28 


253 


GO 


0007610 





08 


40 


GO 


0020037 





26 


176 


GO 


0005624 





08 


616 


GO 


0005938 





26 


50 


GO 


0043087 





08 


22 


GO 


0005856 





24 


816 


GO 


:0003779 





08 


423 


GO 


0016567 





23 


103 


GO 


0008047 





07 


60 


GO:0003779 





23 


431 


GO 


0042995 





07 


231 


GO 


0042592 





22 


9 


GO 


0006928 





07 


166 


GO 


0051607 





21 


26 


GO 


0003924 





07 


294 


GO 


0016564 





18 


229 


GO 


0007568 





06 


35 


GO 


0005200 





16 


127 


GO 


0043234 





06 


233 


GO 


0030097 





15 


76 


GO 


0007268 





06 


201 


GO 


0009615 





14 


111 


GO 


0030030 





05 


27 


GO 


0008092 





12 


77 


GO 


0005525 





05 


450 


GO 


0030099 





07 


19 


GO 


0006412 





05 


466 


GO 


0019900 





04 


32 


GO 


0043005 





05 


51 


GO 


0034101 





00 


8 


GO 


0006836 





05 


42 


GO 


0051707 





00 


5 


GO 


0043025 





04 


82 












GO 


0042221 





00 


16 












GO 


0009266 





00 


6 












GO 


0014070 





00 


13 












GO 


0046578 





00 


8 












GO 


0050804 





00 


11 












GO 


0017076 





00 


7 



The only common pathway between early and late stage PD is actin binding (GO:0003779), as commented in Section |4^ The specific 
ones for the early stage PD concern the immune system (i.e. GO:0045087, GO:0006955), the response to stimulus (i.e. stress or other 
organism like virus, GO:0006950, GO:0009615), the regulation of metabolic processes, the biological quality and cell death. The specific 
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pathways for late stage are related to the nervous system (e.g. neurotransmitter transport, transmission of nerve impulse, learning or memory, 
GO:0006836, GO:0019226) and to response to stimuli (e.g. behavior, temperature, organic substances, drugs or endogenous stimuli). 

Figure|5]visualizes the enriched pathways in the Molecular Function and Biological Process domains. Despite only one pathway was found 
as common between early and late AD, it is easy to note that the majority of selected pathways belong to common GO classes. 




(a) MP (b) BP 

Fig. 5. GO subgraphs for Pai'kinson's early and late stage (Molecular Function and Biological Processes domains). Selected nodes are represented in light 
gray, gray and dark gray for late, early and common nodes. 

Tables |7]and[8]list respectively the subset of elements of the early PD signature and late PD signature having non zero differential node 
degree Ad. We recall here that the top elements in the two tables are those whose number of interactions varies most between the two 
case/control conditions. 

In Table |7] we note that the most disrupted genes for early PD (IFI44L, HSPBl, MAFF, DEFA1/DEFA3, OXRl) belong to pathways 
related to response to stress and to virus. Moreover, several genes (HLA-DQBl, HBB, HBA1/HBA2, DEFA1/DEFA3) are related to the 
following pathways: heme binding, iron ion binding and immune response. 

In Table[8]the majority of disrupted genes for late PD (RGS4, CDC42, RABGAPIL) occur in pathways that are related to GTP, a purine 
nucleotide that can function either as source of energy for protein synthesis and in the signal transduction particularly with G-proteins. Other 
genes in the list (MYHIO, RGS4, PHACTRl, SYTl, VCL) are related to the acting and calmodulin binding, to the synaptic transmission, 
the neurotransmitter transport, the cell-cell signaling, the translation and the cellular component movement. The majority of the disrupted 
pathways are located in several components of the neurons. 
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Table 7. PD Experiment (early): list of Affymetrix probesets in the early stage signature with their corresponding Entrez Gene Symbol ID and GO pathway. 
The list is ranked according to the decreasing absolute value of the differential node degree Ad. 



Affy Probeset ID Gene Symbol Pathway Ad 



200931_s_at 


VCL 


GO: 


10005856 


-2. 


,124 


20093 l_s_at 


VCL 


GO: 


10003779 


-2. 


.107 


213067_at 


MYHIO 


GO 


: 0003779 


1, 


,879 


202887 _s_at 


DDIT4 


GO: 


:0006915 


-1. 


,872 


201841_s_at 


HSPBl 


GO: 


:0042802 


-1. 


,691 


204439_at 


IFI44L 


GO 


: 0006955 


-1, 


,585 


201841_.s_at 


HSPBl 


GO 


:0005856 


-1, 


,532 


209480_at 


HLA-DQBl 


GO 


:0006955 


-1, 


,340 


209116_!c_at 


HBB 


GO: 


:0005506 


-1. 


,008 


Z.UJ!770_a_al 


oil! 


V. I V 7 




u. 


,oo^ 


3671 l_at 


MAFF 


GO 


: 0006950 


-0, 


,807 


209116_x_at 


HBB 


GO: 


:0020037 


-0, 


,599 


36711_at 


MAFF 


GO 


:0046983 


0, 


,567 


214414_x_at 


HBA1/HBA2 


GO: 


:0020037 


-0. 


,376 


205033_s_at 


DEFA1/DEFA3 


GO: 


10009615 


-0. 


,360 


217232jc_at 


HBB 


GO 


:0005506 


-0, 


,239 


205033_s_at 


DEFA1/DEFA3 


GO 


:0006955 


-0, 


,228 


213067_at 


MYHIO 


GO 


:0005938 


0, 


,183 


214414jx_at 


HBA1/HBA2 


GO: 


:0005506 


-0. 


,182 


201841_s_at 


HSPBl 


GO: 


10006950 


-0. 


,154 


217232_x_at 


HBB 


GO 


: 0020037 


-0, 


,088 


218197 _s_at 


OXRl 


GO: 


:0006950 


-0. 


,059 


205033_s_at 


DEFA1/DEFA3 


GO: 


10006952 


0. 


,027 
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Table 8. PD Experiment (late): list of Affymetrix probesets in the late stage signature with their corresponding Entrez Gene Symbol and GO pathway. The 
list is ranked according to the decreasing absolute value of the differential node degree Ad. 



Affy Probeset ID 


Entrez Gene Symbol 


Pathway 


Ad 


213638_at 


PHACTRl 


00:0045202 


-3.255 


213067_at 


MYHIO 


GO:0003779 


-3.252 


213067_at 


MYHIO 


GO:0005516 


-2.597 


204337_at 


RGS4 


GO:0005096 


-2.194 


213067_at 


MYHIO 


GO:0043025 


-2.107 


213067_at 


MYHIO 


GO:0043005 


-1.696 


214230_at 


CDC42 


GO:0003924 


1.677 


213638_at 


PHACTRl 


GO:0003779 


-1.587 


213067_at 


MYHIO 


GO:0030424 


-1.170 


205857_at 


SLC18A2 


GO:0006836 


-1.094 


206552_s_at 


TACl 


GO:0007268 


-0.834 


206552_s_at 


TACl 


GO:0007267 


0.809 


205110_s_at 


FGF13 


GO:0007267 


-0.804 


203998_s_at 


SYTl 


GO:0005516 


-0.787 


208319_s_at 


RBM3 


GO:0006412 


-0.759 


201909_at 


RPS4Y 


GO:0006412 


-0.688 


20093 l_s_at 


VCL 


GO:0043234 


-0.655 


204337_at 


RGS4 


GO:0005516 


-0.602 


205105_at 


MAN2A1 


GO:0007585 


-0.502 


205857_at 


SLC18A2 


GO:0005624 


-0.428 


201841_s_at 


HSPBl 


GO:0006928 


0.424 


214230_at 


CDC42 


GO:0042995 


-0.379 


214230_at 


CDC42 


00:0005525 


0.370 


203998_s_at 


SYTl 


00:0043005 


-0.357 


203998_s_at 


SYTl 


00:0045202 


-0.339 


20093 l_s_at 


VCL 


00:0003779 


-0.311 


20093 l_s_at 


VCL 


00:0006928 


-0.308 


206836_at 


SLC6A3 


00:0006836 


-0.238 


215342_s_at 


RABGAPIL 


00:0005096 


-0.211 


211727_s_at 


COX 11 


00:0007585 


0.188 


203998_s_at 


SYTl 


00:0007268 


-0.159 
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6.2.3 Alzheimer's Disease Experiment Table |9] reports the most discriminant patliways for tire two AD stages as selected by the 
presented pipeline, ranked by decreasing normalized e distance. Table |3] summarizes the main results here detailed in Table |9] [TO] and 
[TT| The common pathways are: GO:0019226 i.e. transmission ofnen'e impulse, GO:0008015 i.e. blood circulation, GO:0000267 i.e. cell 
fraction and 00:0042598 i.e. vesicular fraction. The relevance of blood circulatory system in AD has already been highlighted in [Brown] 
land Thorelj2011^ and references therein. 

Figure|6]visualizes the enriched pathways in the Molecular Function and Biological Process domains. Despite only 4 pathways were found 
as common between early and late AD, it is easy to note that the majority of selected pathways belong to common OO classes. 




(a) MF (b) BP 



Fig. 6. GO subgraphs for Alzheimer's early and late stage (Molecular Function and Biological Processes domains). Selected nodes are represented in light 
gray, gray and dark gray for late, early and common nodes. 

Tables[T0|and| 1 1 [provide details of the network analysis results on early and late stage AD, respectively. The elements of the two signatures 
having non zero Ad are listed for decreasing absolute value of the differential node degree score, thus giving top positions to genes that change 
most the interaction network between the case/control condition. 

Table[10[reports the most disrupted probesets within the early stage AD, ranked according to the differential node degree Ad. We note that 
the most disrupted gene is HBB, within regulation of blood vessel size and regulation of blood vessels. 

Table [TT] reports the most disrupted genes within the late stage AD, ranked according to the differential node degree Ad. The majority 
of such genes (SPEN, SNCA, EIF2AK1, ELFl, CAT, ATXNl, HBD) belong to regulation of locomotion, transcription repressor activity, 
response to drug and heme binding. 
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Table 9. AD Experiment: selected pathways for early (left) and late (right) stage corresponding to mostly discriminant genes gi, ...ygk ranked by the 
normalized Ipsen-Mikhailov distance e. The number of genes belonging to the pathway is also provided. In bold, the common pathways. 



AD early 
Pathway e # Genes 



GO:0048514 





22 


22 


GO:0042598 





21 


16 


GO:0016881 





19 


109 


GO:0019787 





16 


116 


GO:0019725 





16 


14 


00:0051246 





15 


121 


00:0001508 





14 


31 


GO:0006631 





14 


171 


GO:0030234 





13 


29 


00:0016874 





12 


735 


00:0004842 





11 


368 


00:0007417 





10 


199 


GO:0012505 





10 


216 


GO:0050880 





09 


26 


00:0048471 





08 


263 


00:0005792 





08 


409 


00:0005768 





08 


490 


00:0004857 





08 


57 


00:0031982 





07 


34 


00:0016567 





07 


206 


00:0008217 





07 


105 


GO:0001666 





07 


225 


GO:0030141 





06 


69 


00:0050877 





06 


31 


00:0042552 





05 


36 


00:0001568 





05 


79 


00:0048511 





04 


49 


00:0016023 





04 


108 


00:0007399 





04 


806 


GO:0008015 





04 


103 


00:0042391 





04 


67 


00:0031410 





03 


482 


00:0046982 





03 


364 


00:0006633 





02 


109 


00:0045121 





02 


136 


00:0004866 





02 


194 


00:0008366 





00 


22 


00:0019228 





00 


19 


00:0006873 





00 


10 


00:0042592 





00 


25 


00:0001974 





00 


28 


00:0019226 





00 


27 


00:0001944 





00 


4 


00:0048771 





00 


12 


00:0048856 





00 


20 


00:0019838 





00 


85 


00:0017076 





00 


11 


00:0030414 





00 


42 


00:0001882 





00 


8 


00:0000267 





00 


4 


00:0031090 





00 


6 



AD late 
Pathway e # Genes 



00:0040012 





36 


9 


00:0042598 





23 


16 


00:0019226 





12 


27 


00:0030334 





10 


93 


00:0045892 





09 


218 


00:0009968 





06 


107 


00:0042493 





06 


160 


00:0050877 





06 


31 


00:0042127 





05 


140 


00:0009725 





05 


47 


00:0042277 





05 


63 


00:0015630 





05 


99 


00:0008283 





04 


785 


00:0005819 





04 


142 


00:0008217 





03 


106 


00:0005626 





03 


68 


00:0000165 





03 


94 


00:0005215 





03 


685 


00:0007268 





03 


377 


00:0007601 





03 


402 


00:0008289 





03 


285 


00:0007610 





03 


84 


00:0008284 





02 


507 


00:0001503 





02 


171 


00:0007243 





02 


220 


00:0008285 





02 


578 


00:0008015 





02 


103 


00:0016564 





02 


380 


00:0020037 





02 


265 


00:0051270 





00 


9 


00:0010033 





00 


44 


00:0050890 





00 


31 


00:0050953 





00 


24 


00:0000267 





00 


5 



18 



Machine Learning Pipeline for Discriminant Pattiways Identification 



Table 10. AD Experiment (eariy): list of Affymetrix probesets in the early stage signature with their corresponding Entrez Gene Symbol and GO pathway. 
The list is ranked according to the decreasing absolute value of the differential node degree Ad. 



Affy Probeset ID Gene Symbol Pathway Ad 



209116j!c_at 


HBB 


GO: 


;0050880 


1, 


.670 


209116_x_at 


HBB 


GO: 


: 0008217 


1, 


.445 


211748_x_at 


PTGDS 


GO: 


:0006633 


1, 


.273 


240383_at 


UBE2D3 


GO: 


:0016874 


-1, 


.165 


240383_at 


UBE2D3 


GO: 


:0019787 


-0, 


.703 


ZUIUO 1 _S_aL 


O 1 WiVl 






-u. 


.ooz 


240383_at 


UBE2D3 


GO: 


:0051246 


-0 


.613 


201983_s_at 


EGFR 


GO: 


:0046982 


-0 


.476 


221795_at 


NTRK2 


GO: 


: 0007399 


-0, 


.262 


212226_s_at 


PPAP2B 


GO: 


;0001568 


0, 


.259 


201983_s_at 


EGFR 


GO: 


:0005768 


-0 


.256 


211696_x_at 


HBB 


GO: 


:0050880 


-0 


.224 


209072_at 


MBP 


GO: 


:0008366 


0, 


.166 


211696jc_at 


HBB 


GO: 


: 0008217 


-0, 


.149 


212187 _x_at 


PTGDS 


GO: 


:0006633 


-0, 


.139 


201185_at 


HTRAl 


GO: 


:0019838 





.124 


240383_at 


UBE2D3 


GO: 


: 0004842 





.120 


209072_at 


MBP 


GO: 


:0007417 


0, 


.113 


240383_at 


UBE2D3 


GO: 


: 0016567 


-0, 


.047 



Table 11. AD Experiment (late): list of Affymetrix probesets in the late stage signature with their corresponding Entrez Gene Symbol and GO pathway. The 
list is ranked according to the decreasing absolute value of the differential node degree Ad. 



Affy Probeset ID Gene Symbol Pathway Ad 



201996_s_at 


SPEN 


GO: 


:0016564 


1, 


.590 


211546_x_at 


SNCA 


GO: 


:0040012 


1, 


.410 


211546_x_at 


SNCA 


GO: 


:0042493 


1, 


.310 


201996_s_at 


SPEN 


GO: 


:0045892 


1, 


.246 


217736_s_at 


EIF2AK1 


GO: 


:0020037 


-1, 


.066 


201005_at 


CD9 


GO: 


:0008285 


0, 


.725 


210943_s_at 


LYST 


GO: 


:0015630 


0, 


.706 


204466_s_at 


SNCA 


GO: 


:0042493 


0, 


.461 


207827 _x_at 


SNCA 


GO: 


: 0040012 


0, 


.434 


206698_at 


XK 


GO: 


:0005215 


0, 


.433 


209184_s_at 


IRS2 


GO: 


: 0008283 


0, 


.208 


212420_at 


ELFl 


GO: 


:0016564 


-0, 


.203 


207827 _x_at 


SNCA 


GO: 


: 0042493 


0, 


.201 


205592_at 


SLCA4A1 


GO: 


:0005215 


0, 


.180 


211922_s_at 


CAT 


GO: 


:0008283 


0, 


.173 


211922_s_at 


CAT 


GO: 


:0020037 


-0, 


.094 


20323 l_s_at 


ATXNl 


GO: 


:0016564 


-0, 


.073 


217736_s_at 


EIF2AK1 


GO: 


:0008285 


-0 


.072 


204466 _s_at 


SNCA 


GO: 


:0040012 





.048 


206834_at 


HBD 


GO: 


:0008217 


0, 


.045 


206834_at 


HBD 


GO: 


: 0020037 


0, 


.019 
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